-
Notifications
You must be signed in to change notification settings - Fork 7
Big new design Part 2 :) #307
Conversation
…n prepare_ml_batches.py. Renamed DataSourceList to Manager. Started fleshing out Manager class.
…f_each_example.csv for each split
|
Hi @jacobbieker OK, I think this new PR is just about ready for review. It's still definitely a draft, and there are a bunch of things still to do (please see the tick-list at the top of this PR conversation) but please do take a look at comment on the broad shape of the thing 🙂 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like how it looks, I think the design is good!
| zipped = list(zip(t0_datetimes, x_locations, y_locations)) | ||
| batch_size = len(t0_datetimes) | ||
|
|
||
| with futures.ThreadPoolExecutor(max_workers=batch_size) as executor: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work @JackKelly.
You can really see the fruits of your labour when lots of things simplify nicely.
I've added quite a few minor comments, and the only major comment is https://github.com/openclimatefix/nowcasting_dataset/pull/307/files#r740808191, but perhaps I'm missing something there
Ill that you left a few TODOs in there which are too critical. Good to break up the PR with these
|
OK! I'm going to go ahead and merge this now. All the tests pass and I need to fix #325 but that shouldn't stop the script from being used. |
Pull Request
Description
This is quite a big PR, sorry, because it's plugging together the new code!
The majority of the file changes are to get the changes to pass the linter! (Sorry, I really should've done that in a separate PR, if I'd understood how much more strict the new linter config would be!)
Broadly implements an updated version of the design first sketched out in #213 (comment)
Also implements / fixes some other issues which were blocking this PR:
load_solar_pv_data_from_gcs()should usefsspecand hence be able to load data from any compute environment #286split()function #299DataSourceListintoManager; and maintain DataSources in adictinstead of a list? #298prepare_ml_data.py#171The main bulk of this PR does several things (sorry that not all of these are strictly related to implementing issue #213!)
Managerclass which looks after multipleDataSourcesspatial_and_temporal_locations_of_each_example.csvfiles.spatial_and_temporal_locations_of_each_example.csvfiles and start separate processes for eachDataSourceto prepare batches.Manager.create_batches()andManager._get_first_batches_to_create()filesystem.utils.get_maximum_batch_id(), and useglobinstead oflsto be sure that we're searching*.ncinstead of*.utils.get_netcdf_filename(batch_idx)00001.nc? #308filesystem.utils.make_folder()tomakedirs()to be consistent with fsspec. And simplify.dataset/datamodule.pydataset/datasets.pyn_timesteps_per_batchstuff. Done in commit f896a5eprepare_ml_data.pyrunning!DataSource.check_directories()and call fromDataSource.__init__ImplementDataSource.check_paths_exist()for all DataSources. #306batch_to_datasetanddataset_to_batchHow Has This Been Tested?
Checklist: